Table of Contents
In this chapter we'll begin to explore the goals, approaches, and challenges for creating phylogenetic trees, or phylogenies. Phylogenies, such as the two presented in Figure 1, represent hypotheses about the evolutionary history of a groups, who are represented by the tips in the tree. You can explore an interactive version of the three-domain tree presented in Figure 1b online, through the Interactive Tree of Life project.
|
|
| Figure 1a: Evolutionary tree presented by Charles Darwin in On the Origin of Species. Figure 1b: A hypothesis of evolutionary relationships between the three domains of life. This image was created by the Norman Pace Laboratory. | |
Reconstructing the phylogeny of a group of individuals is useful for many reasons. Probably the most obvious of these is understanding the evolutionary relationship between a group of organisms. For example, over the past half-century we've gained great insight into the evolution of our species, Homo sapiens, by studying features of our closest relatives, both extant (still existing) organisms, such as the Pan (chimpanzees) and Gorilla genera, and extinct (no longer living) species, including Homo neanderthalensis, Homo erectus, Homo habilus, and many species in the Australopithecus genus. (The Smithsonian Museum's Human Origins Initiative is an excellent resource for learning more about the fascinating subject of human evolution.)
For example, Woese and Fox (1977) used phylogenetic reconstruction to first illustrate that the "prokaryotes" really represented two ancient lineages which they called the eubacteria and the archaebacteria, which ultimately led to the proposal of a "three domain" tree of life composed of the three deep branching linages, the archaea, the bacteria, and the eucarya (Woese, Kandler and Wheelis (1990)).
Phylogenetic trees such as these are also useful for understanding evolution itself.
The individuals represented at the tips of our trees don't necessarily have to be organisms though. In another important application of phylogenetic trees, we can study the evolution of genes, which can help us gain a deeper understanding of gene function. Specifically, we can learn about families of related genes.
A classic example of this is the globin family, which includes the proteins hemoglobin and myoglobin, molecules that can reversibly bind oxygen (meaning they can bind to it, and then let go of it). You've probably heard of hemoglobin (if not globins in general), as this molecule binds to oxygen where it is present in high concentration (such as in your lung) and releases it where it is present in low concentration (such as in the bicep, where it is ultimately used to power your arm). Hemoglobin and myoglobin are paralogs, meaning that they are related by a gene duplication and subsequent divergence. If you were to compare an unknown globin sequence to either of these you could detect homology, but a tree such as the one present in Figure 2, would help you understand the type of homologous relationship (i.e., whether it was orthology or paralogy).
An additional application that we'll cover in this text is comparing the composition of communities of organisms, but we'll come back to that later.
Phylogenies are reconstructed using a variety of different algorithms, some of which we'll cover in this chapter. These algorithms all work by comparing a set of features of organisms, and inferring the evolutionary distance between those organisms based on the similarity of their features. The features that are compared can be nearly anything that is observable, either from extant organisms or fossilized representatives of extinct organisms.
As an example, let's consider the reconstruction of the phylogeny of spiders (the order Araneae), a hypothesis of which is presented in Figure 3. Of the extant spiders, some are orb-weavers (meaning they spin circular, sticky webs), and others are not. Entomologists have debated whether orb-weaving is a monophyletic trait (meaning that it evolved one time), or whether it is polyphyletic (meaning that it evolved multiple times, such as flight, which has evolved independently in birds, flying dinosaurs, insects, and mammals). If orb-weaving is monophyletic, it would mean that over the course of evolution, extant spiders which don't weave orb webs have lost that ability. Some researchers doubt this as it's a very effective means of catching prey, and losing that ability would likely constitute an evolutionary disadvantage. If orb-weaving is polyphyletic, it would means that in at least two different spider lineages, this trait arose independently, which other researchers consider to be very unlikely due to the complexity of engineering these webs. Examples of the evolution of monophyletic and polyphyletic traits are presented in Figures 3 and 4, respectively.
Earlier work on understanding the relations between the spider lineages focused on comparing traits that entomologists would observe, for example by watching spiders in action or by dissecting them. For example, in 1986 through 1991, Johnathan Coddington published several studies that tabulated and compared about sixty features of 32 spider taxa (Coddington J. 1986. The monophyletic origin of the orb web. In: Shear W, ed. Spiders: webs, behavior, and evolution. Stanford, California: Stanford University Press. 319-363; Coddington JA. 1991. Cladistics and spider classification: araneomorph phylogeny and the monophyly of orbweavers (Araneae: Araneomorphae; Orbiculariae) Acta Zoologica Fennica 190:75-87). Features included whether a spider wraps its prey when it attacks (a behavioral trait), and how branched the spider's trachea is (a morphological trait). By determining which spiders were more similar and different across these traits, Dr. Coddington provided early evidence for the hypothesis that orb-weaving is an ancient monophyletic trait.
For example, (Garrison et al., 2016), compared approximately 700,000 amino acid positions from nearly 4000 loci around the genomes of 70 spider taxa. Compare the number of features here to the number mentioned in the previous paragraph. These phylogenomic studies have further supported the idea that orb-weaving is an ancient monophyletic trait, and have provided much finer scale information on the evolution of spiders. Supported by these data, researchers hypothesize that the loss of orb-weaving might not be that surprising. While it does provide an effective means of catching flying insects, many insects which are potential prey for spiders don't fly. Further, orb webs may attract predators of spiders, as they are easily observable signals of where a spider can be found.
For the remainder of this chapter, we'll consider methods for phylogenetic reconstruction that use genome sequence data as features.
We'll use all of these terms below as we begin to explore phylogenetic trees.
Table of Contents
Before we jump into how to reconstruct a phylogeny from DNA sequence data, we're going to perform a simulation of the process of evolution of a DNA sequence. In this simulation, we're going to model sequence evolution with a Python function, and then we're going to run that function to simulate multiple generations of evolution.
Bioinformatics developers often use simulations to understand how their algorithms work, as they uniquely provide an opportunity to know what the correct answer is. This provides a way to compare algorithms to each other to figure out which performs best under which circumstances. In our simulation we're going to have control over the starting sequence, and the probability of incurring a substitution mutation or an insertion/deletion mutation at each position of the sequence in each generation.
Our simulation will works as follows.
evolve_generations. Let's inspect this code and then run our simulation beginning with a random sequence.
%pylab inline
import numpy as np
import seaborn as sns
import random
#from IPython.core import page
#page.page = print
First we'll look at the function used to simulate the evolution of a single sequence. This is where most of the important evolutionary modeling of sequence evolution happens.
from iab.algorithms import evolve_sequence
%psource evolve_sequence
Next, take a look at the function that models a single generation of a single sequence. This is where the clonal reproduction (i.e., one parent sequence becoming two child sequences) occurs.
from iab.algorithms import evolve_generation
%psource evolve_generation
Finally, take a look at our entry point function. This is where we provide the parameters of our simulation, including the starting sequence, the number of generations, and the mutation probabilities. Notice how each of these functions builds on the prior functions.
from iab.algorithms import evolve_generations
%psource evolve_generations
Now we'll run our simulation. We'll start with a random DNA sequence, and then evolve three generations.
When we call evolve_generations, we'll pass the parameter verbose=True. This will tell the function to print out some information throughout the process. This will let us inspect our evolutionary process: something that is impossible to do with real sequences from history, though we can observe microbial evolution in the lab.
from iab.algorithms import random_sequence
import skbio
sequence = random_sequence(skbio.DNA, 50)
sequences = evolve_generations(sequence, generations=3,
substitution_probability=0.1,
indel_probability=0.05,
increased_rate_probability=0.1,
verbose=True)
We now have a new variable, sequences, which contains the child sequences from the last generation.
1 or 2. 0, the identifier of the last common ancestor (or our starting sequence) of all of the sequences. These identifiers will help us interpret whether the phylogenies that we reconstruct accurately represent the evolutionary relationships between the sequences.
Take a minute to compare the two sequences below. What types of mutations happened over the course of their evolution?
print(len(sequences))
sequences[0]
sequences[-1]
Let's simulate 10 generations of sequences here, and then randomly select some of those sequences to work with in the remaining sections of this chapter. For the sake of runtime, I'm going to set our indel_probability=0.0. This means that we won't need to align our selected sequences before constructing trees from them (because with only substitution mutations occurring, homologous positions always remain aligned). Multiple sequence alignment is currently a runtime bottleneck in IAB, so this will let us run this notebook much faster. If you'd like to model insertion/deletions, you can increase the indel_probability, say to 0.005. If you do that, the sequences will be aligned for you in the next cell, but it may take around 30 minutes to run.
from skbio.alignment import global_pairwise_align_nucleotide
from iab.algorithms import progressive_msa
from functools import partial
indel_probability = 0.0
sequences = evolve_generations(sequence,
generations=10,
substitution_probability=0.03,
indel_probability=0.0,
increased_rate_probability=0.1,
verbose=False)
sequences = random.sample(sequences, 25)
if indel_probability == 0:
sequences_aligned = sequences
else:
gpa = partial(global_pairwise_align_nucleotide,
penalize_terminal_gaps=True)
sequences_aligned = progressive_msa(sequences,
pairwise_aligner=gpa)
While simulations are extremely powerful for comparing algorithms, they can also be misleading. This is in part because when we model evolution we simplify the evolutionary process, and the model is stochastic. One assumption that our simulation is making is that "bursts" of evolution (i.e., where our substitution rate temporarily increases) are restricted to only a single generation. A child is no more or less like to have an increased rate of substitutions if its parent did. This may or may not be an accurate model. Imagine in the real-world that the environment changed drastically for some descendants (for example, if a geological event created new thermal vents in a lake they inhabited, resulting in an increase in mean water temperature), but not for others. The lineages who experience the environmental change might have an increased rate of substitutions as their genomes adapt to the new environment. The increased substitution rate may persist for multiple generations, or it may not. This is one limitation of our simulation that we know about, but because we don't have a perfect understanding of sequence evolution, there are limitations that we don't know about.
When using simulations, it's important to understand what assumptions the simulation makes so you know what it can tell you about and what it can't tell you about. You'll want to consider that when determining how confident you are in the results of an evaluation based on simulation.
On the opposite end of the spectrum from simulations for algorithm comparison is comparisons based on real data, where we don't know groud-truth.
The take-away message here is that neither approach is perfect, and often researchers will use a combination of simulated and real data to evaluate algorithms.
As we now start computing phylogenetic trees, we're going to need a way to visualize them. We'll use the ete3 Python package for this, and in the next cell we'll configure the TreeStyle which is used to define how the trees we visualize will look. If you'd like to experiment with other views, you can modify the code in this cell according to the ete3 documentation. If you come up with a nicer style, I'd be interested in seeing that in a pull request. You can post screenshots to IAB issue #213 before creating a pull request so I can see what the new style looks like.
import ete3
help(ete3.Tree)
ts = ete3.TreeStyle()
ts.show_leaf_name = True
ts.scale = 250
ts.branch_vertical_margin = 15
We can apply this TreeStyle to a random tree as follows. Any changes that you make to the TreeStyle above will impact this tree and all following trees in this chapter. Experiment with this - it's fun!
# Make a tree object
t = ete3.Tree()
# Generates a random tree of 10
t.populate(10)
# Plot the tree
t.render("%%inline", tree_style=ts)
Table of Contents
The next approaches we'll take for phylogenetic reconstruction rely on computing distances between sequences. We've previously discussed distances between sequences in a few places in the text. We'll begin this section by formalizing the term distance, and introducing the concept of a distance matrix.
An illustration comparing the taxicab metric to the Euclidean metric on the plane:
https://en.wikipedia.org/wiki/Metric_(mathematics)
https://en.wikipedia.org/wiki/Metric_space
https://en.wikipedia.org/wiki/Similarity_measure
Formally, a measure of dissimilarity $d$ between two objects $x$ and $y$ is a distance if it meets these four criteria for all $x$ and $y$:
When we compute the distances between some number of objects $n$, we'll commonly represent those values in a distance matrix which contains all of those values.
These distance matrices are so common in bioinformatics that scikit-bio defines a DistanceMatrix object that provides a convenient interface for working with these data.
from skbio import DistanceMatrix
from skbio.stats.distance._base import DissimilarityMatrix
#help(DistanceMatrix)
help(DissimilarityMatrix)
We can create one of these objects as follows:
dm = DistanceMatrix([[0.0, 1.0, 2.0],
[1.0, 0.0, 3.0],
[2.0, 3.0, 0.0]],
ids=['a', 'b', 'c'])
We can then access the values in the distance matrix directly, view the distance matrix as a heatmap, and do many other things that facilitate analyzing distances between objects.
print(dm['a', 'b'])
print(dm['b', 'c'])
help(dm.plot)
_ = dm.plot(cmap='Greens')
Which of the conditions listed above results in each of these features of distance matrices?
We've now looked at several ways of computing distances between sequences, some of which have required that the positions in the sequences are directly comparable to one another (i.e., that our sequences are aligned), and some of which haven't.
We can use the kmer_distance function with scikit-bio as follows to create an skbio.DistanceMatrix object.
from iab.algorithms import kmer_distance
%psource kmer_distance
help(DistanceMatrix.from_iterable)
print(sequences)
kmer_dm = DistanceMatrix.from_iterable(sequences,
metric=kmer_distance,
key='id')
print(kmer_dm)
_ = kmer_dm.plot(cmap='Greens',
title='3mer distances between sequences')
One alignment-based distance metric that we've looked at is Hamming distance. This would be considered an alignment-based approach because is does consider the order of the characters in the sequence by comparing a character at a position in one sequence only to the character at the corresponding position in the other sequence. We could compute these distances as follows, after first aligning our sequences.
Note: here, we specified a 0 probability of indel mutations, with sequences of identical length, and thus they are already aligned:
from skbio.sequence.distance import hamming
hamming_dm = DistanceMatrix.from_iterable(sequences_aligned,
metric=hamming,
key='id')
_ = hamming_dm.plot(cmap='Greens',
title='Hamming distances between sequences')
UPGMA is "ultrametric", meaning that all the terminal nodes (i.e. the sequences/taxa) are equally distance from the root. In molecular terms, this means that UPGMA assumes a molecular clock, i.e. all lineages are evolving at a constant rate. In practical terms, this means that you can construct a distance scale bar and all the terminal nodes will be level at position 0.0, representing the present. In this example, the scale bar is shown on the right-hand side.
A to C. C to T. C back to A (a back substitution), we would observe zero substitution mutations at that position even though two had occurred.The Python implementation of this correction looks like the following. We can apply this to a number of input distance values to understand how it transforms our Hamming distances.
def jc_correction(p):
return (-3 / 4) * np.log(1 - (4 * p / 3))
distances = np.arange(0, 0.70, 0.05)
jc_corrected_distances = list(map(jc_correction, distances))
ax = sns.pointplot(distances, jc_corrected_distances)
ax.set_xlabel('Hamming distance')
ax.set_ylabel('JC-corrected distance')
ax.set_xlim(0)
ax.set_ylim(0)
ax
We can then apply this to a full distance matrix as follows (we'll then print the first row of each).
def jc_correct_dm(dm):
result = np.zeros(dm.shape)
for i in range(dm.shape[0]):
for j in range(i):
result[i, j] = result[j, i] = jc_correction(dm[i, j])
return skbio.DistanceMatrix(result, ids=dm.ids)
jc_corrected_hamming_dm = jc_correct_dm(hamming_dm)
print(hamming_dm[0])
print(jc_corrected_hamming_dm[0])
# Pre-correction:
_ = hamming_dm.plot(cmap='Greens', title='Hamming distances between sequences')
# Post-correction:
_ = jc_corrected_hamming_dm.plot(cmap='Greens', title='JC-corrected Hamming distances between sequences')
Table of Contents
The first algorithm we'll look at for reconstructing phylogenetic trees is called UPGMA, which stands for Unweighted Pair-Group Method with Arithmetic mean. While that name sounds complex, it's actually a straightforward algorithm, which is why we're starting with it. After we work through the algorithm, we'll come back to the name as it'll make more sense then.
UPGMA is a generic hierarchical clustering algorithm. It's not specific to reconstructing biological trees, but rather is used for interpreting any type of distance matrix. It is fairly widely used for building phylogenetic trees, though it's application in phylogenetics is usually restricted to building preliminary trees to "guide" the process of multiple sequence alignment. The reason for this is that it's fast, but it makes some assumptions that don't work well for inferring relationships between organisms, which we'll discuss after working through the algorithm.
UPGMA starts with a distance matrix, and works through the following steps to create a tree.
Step 1: Find the smallest non-zero distance in the matrix and define a clade containing only those members. Draw that clade, and set the total length of the branch connecting the tips to the distance between the tips. The distance between each tip and the node connecting them should be half of the distance between the tips.
Step 2: Create a new distance matrix with an entry representing the new clade created in step 1.
Step 3: Calculate the distance matrix entries for the new clade as the mean distance from each of the tips of the new clade to all other tips in the original distance matrix.
Step 4: If there is only one distance (below or above the diagonal) in the distance matrix, use it to connect the remaining unconnected clades, and stop. Otherwise repeat step 1.
Let's work through these steps for a simple distance matrix representing the distances between five sequences.
_data = np.array([[ 0., 4., 2., 5., 6.],
[ 4., 0., 3., 6., 5.],
[ 2., 3., 0., 3., 4.],
[ 5., 6., 3., 0., 1.],
[ 6., 5., 4., 1., 0.]])
_ids = ['s1', 's2', 's3', 's4', 's5']
master_upgma_dm = skbio.DistanceMatrix(_data, _ids)
print(master_upgma_dm)
Step 1: The smallest non-zero distance in the above matrix is between s4 and s5. So, we'll draw that clade and set each branch length to half of the distance between them.

Step 2: Next, we'll create a new, smaller distance matrix where the sequences s4 and s5 are now represented by a single clade which we'll call (s4, s5). This notation indicates that the corresponding distances are to both s4 and s5.
iter1_ids = ['s1', 's2', 's3', '(s4, s5)']
iter1_dm = [[0.0, 4.0, 2.0, None],
[4.0, 0.0, 3.0, None],
[2.0, 3.0, 0.0, None],
[None, None, None, None]]
Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between each pre-existing clade, and each of the sequences in the new clade. For example, the distance between s1 and (s4, s5) is the mean of the distance between s1 and s4 and s1 and s5:
import numpy as np
# Note: we are not using the iter1_dm, but the original matrix (master_upgma_dm)
s1_s4s5 = np.mean([master_upgma_dm['s1', 's4'], master_upgma_dm['s1', 's5']])
print(s1_s4s5)
Similarly, the distance between s2 and (s4, s5) is the mean of the distance between s2 and s4 and s2 and s5:
s2_s4s5 = np.mean([master_upgma_dm['s2', 's4'], master_upgma_dm['s2', 's5']])
print(s2_s4s5)
And finally, the distance between s3 and (s4, s5) is the mean of the distance between s3 and s4 and the distance between s3 and s5:
s3_s4s5 = np.mean([master_upgma_dm['s3', 's4'], master_upgma_dm['s3', 's5']])
print(s3_s4s5)
We can fill these values in to our iteration 1 distance matrix. Why do we only need to compute three values to fill in seven cells in this distance matrix?
iter1_dm = [[0.0, 4.0, 2.0, s1_s4s5],
[4.0, 0.0, 3.0, s2_s4s5],
[2.0, 3.0, 0.0, s3_s4s5],
[s1_s4s5, s2_s4s5, s3_s4s5, 0.0]]
iter1_dm = DistanceMatrix(iter1_dm, iter1_ids)
print(iter1_dm)
Step 4: Because there is still more than one value below the diagonal in our new distance matrix, we start a new iteration by going back to Step 1 and repeating this process.
Step 1: The smallest non-zero distance in the iteration 1 distance matrix is between s1 and s3. So, we'll draw that clade and set each branch length to half of that distance.

Step 2: We next create a new, smaller distance matrix where the sequences s1 and s3 are now represented by a single clade, (s1, s3).
iter2_ids = ['(s1, s3)', 's2', '(s4, s5)']
iter2_dm = [[None, None, None],
[None, 0.0, 5.5],
[None, 5.5, 0.0]]
Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). Notice that the distance between our new clade and s2 is the mean of two values, but the distance between our new clade and the clade defined in iteration 1 is the mean of four values. Why is that?
# For combining on s2 (index 1)
s2_s1s3 = np.mean([master_upgma_dm[1][0], master_upgma_dm[1][2]])
# For combining on s4 and s5 (referencing original master index)
s3s4_s1s3 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4], master_upgma_dm[2][3], master_upgma_dm[2][4]])
We can now fill in all of the distances in our iteration 2 distance matrix.
iter2_dm = [[0.0, s2_s1s3, s3s4_s1s3],
[s2_s1s3, 0.0, 5.5],
[s3s4_s1s3, 5.5, 0.0]]
iter2_dm = DistanceMatrix(iter2_dm, iter2_ids)
print(iter2_dm)
Step 4: There is still more than one value below the diagonal, so we start a new iteration by again repeating the process.
Step 1: The smallest non-zero distance in the above matrix is now between (s1, s3) and s2. So, we'll draw that clade and set each branch length to half of the distance.

Step 2: We'll next create a new distance matrix where the clade (s1, s3) and the sequence s2 are now represented by a single clade, ((s1, s3), s2).
iter3_ids = ['((s1, s3), s2)', '(s4, s5)']
iter3_dm = [[None, None],
[None, 0.0]]
Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). This is now the mean of six distances. Why?
s1s2s3_s4s5 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4],
master_upgma_dm[2][3], master_upgma_dm[2][4],
master_upgma_dm[1][3], master_upgma_dm[1][4]])
We fill this value into our iteration 3 distance matrix.
iter3_dm = [[0.0, s1s2s3_s4s5],
[s1s2s3_s4s5, 0.0]]
iter3_dm = DistanceMatrix(iter3_dm, iter3_ids)
print(iter3_dm)
Step 4: At this stage, there is only one distance below the diagonal in our distance matrix. So, we can use that distance to draw the final branch. This will connect our two deepest clades, ((s1, s3), s2) and (s4, s5), which will give us our final UPGMA tree.

SciPy contains an implementation of UPGMA that we can apply to our existing distance matrices, and we can then visualize the resulting trees with ete3. IAB provides a wrapper function that will give this an interface that is convenient to work with. If you'd like to see what the wrapper function is doing, using the psource IPython magic function as we have in other places in the text.
from iab.algorithms import tree_from_distance_matrix, guide_tree_from_sequences
%psource tree_from_distance_matrix
#%psource guide_tree_from_sequences
from scipy.cluster.hierarchy import average
help(average)
Let's compute and visualize UPGMA trees for the two distance matrices that we created above. How do these trees compare to one another? Does one look more or less correct than the other (they may or may not, depending on the random sample of sequences that are being compared).
One thing to be aware of as you start visualizing trees is that the vertical order (in the default TreeStyle being used here) doesn't have biological meaning, it's purely a visualization component. You can rotate the branches descending from any node in the tree freely.
kmer_tree = tree_from_distance_matrix(kmer_dm,
metric='upgma')
print(type(kmer_tree))
ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
jc_corrected_hamming_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm),
metric='upgma')
ete3.Tree(str(jc_corrected_hamming_tree), format=1).render("%%inline", tree_style=ts)
Which is better?
Which produces a higher percentage of agreement between leaf pairs?
As mentioned above, UPGMA has a rather complex sounding name: Unweighted Pair Group metric with Arithmetic mean.
The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages.
http://scikit-bio.org/docs/latest/generated/skbio.tree.nj.html
nj_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm),
metric='nj')
ete3.Tree(str(nj_tree), format=1).render("%%inline", tree_style=ts)
The following is a rooted tree, which means that it includes an assumption about the last common ancestor of all sequences represented in the tree.
https://en.wikipedia.org/wiki/Phylogenetic_tree#Rooted_tree
https://en.wikipedia.org/wiki/Tree_(graph_theory)#Rooted_tree
The most common method for rooting trees is the use of an uncontroversial outgroup close enough to allow inference from trait data or molecular sequencing, but far enough to be a clear outgroup.

By contrast, inferring the root of an unrooted tree requires some means of identifying ancestry. This is normally done by including an outgroup in the input data so that the root is necessarily between the outgroup and the rest of the taxa in the tree, or by introducing additional assumptions about the relative rates of evolution on each branch, such as an application of the molecular clock hypothesis.

Tree of life showing vertical and horizontal gene transfers:
Phylogenetic network of HVS-I variation in mitochondrial DNA haplogroup R0 and R0b
Improving simulation assumptions
# Do coding here,
# snap screenshots of code/plot,
# and write a paragraph in Canvas,
# with screenshots embedded in the text entry box
# little helper function
def mean_distance(D):
sum = 0
for i in range(len(D)):
for j in range(len(D)):
if i != j:
sum += D[i][j]
return sum/(len(D)*(len(D)-1))
sequences = []
sequences_aligned = []
kmer_av_distances = []
hamming_av_distances = []
import pandas as pd
print(hamming_av_distances)
print(kmer_av_distances)
# entries go up by 50 sequences
def mean_line_graph(hamming_dists, kmer_dists):
cols = []
for i in range(len(hamming_dists)):
val = (i+1)*50
cols.append(str('%d seqs'%(val)))
df = pd.DataFrame({
'hamming': hamming_dists,
'kmer': kmer_dists
},index=cols)
df.plot.line()
def create_evolutions(seq_count, indel_prob):
# create a data set
sequence = random_sequence(skbio.DNA, seq_count)
sequences = evolve_generations(sequence, generations=3,
substitution_probability=0.1,
indel_probability=indel_prob,
increased_rate_probability=0.1,
verbose=False)
return sequences
def create_kmer_dm(sequences, distances, length=50):
kmer_dm = DistanceMatrix.from_iterable(sequences,
metric=kmer_distance,
key='id')
_title = '3mer distances between sequences length %d' %(length)
_ = kmer_dm.plot(cmap='Greens',
title=_title)
df = kmer_dm.to_data_frame()
df = df.values.tolist()
distances.append(mean_distance(df))
return kmer_dm
def create_hamming_dm(sequences, distances, length=50):
if indel_probability == 0:
sequences_aligned = sequences
else:
gpa = partial(global_pairwise_align_nucleotide,
penalize_terminal_gaps=True)
sequences_aligned = progressive_msa(sequences,
pairwise_aligner=gpa)
hamming_dm = DistanceMatrix.from_iterable(sequences_aligned,
metric=hamming,
key='id')
_title='3mer distances between sequences length %d'%(length)
_ = hamming_dm.plot(cmap='Greens',
title=_title)
df = hamming_dm.to_data_frame()
df = df.values.tolist()
hamming_av_distances.append(mean_distance(df))
return hamming_dm
def create_kmer_tree(kmer_dm):
kmer_tree = tree_from_distance_matrix(jc_correct_dm(kmer_dm),
metric='nj')
ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
def create_hamming_tree(kmer_dm):
kmer_tree = tree_from_distance_matrix(jc_correct_dm(kmer_dm),
metric='nj')
ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
def create_both_trees(kmer,hamming):
create_kmer_tree(kmer)
create_hamming_tree(hamming)
def run_all_graphs(iter_counts, indel_prob, kmer_av, hamming_av):
for seq_count in iter_counts:
seqs = create_evolutions(seq_count, indel_prob)
hdm = create_hamming_dm(seqs, hamming_av, seq_count)
kdm = create_kmer_dm(seqs, kmer_av, seq_count)
create_both_trees(kdm,hdm)
hamming_av_distances = []
kmer_av_distances = []
seqs = []
for i in range(8):
seqs.append((i+1)*50)
run_all_graphs(seqs, 0, kmer_av_distances, hamming_av_distances)
mean_line_graph(hamming_av_distances, kmer_av_distances)